Here we use the following links to data.
colocation_path <- "~/Desktop/OUCRU/FacebookData/ColocationMaps/"
census_path <- "~/Desktop/OUCRU/FacebookData/CensusData/census2019.rds"
Change them accordingly if you want to run the script locally on your computer.
This document shows how to combine 3 data sets:
These three data sets provide information by district (roughly 700 in Vietnam). The difficulties come from the fact that these 3 datasets listed above do not have exactly the same districts definitions. Indeed, as the population grows with time, districts tend to split. The problems we had to deal with here are
The needed packages:
library(sf)
library(stars)
library(osmdata)
library(magrittr)
library(stringr)
library(lubridate)
library(tidyr)
library(purrr)
library(dplyr) # best to load last
library(data.table)
library(ggplot2)
Redefining the hist() function:
hist2 <- function(...) hist(..., main = NA)
Downloading the population density data from WorldPop:
setwd("~/Desktop/OUCRU/FacebookData/ColocationMarc")
download.file("ftp://ftp.worldpop.org.uk/GIS/Population/Individual_countries/VNM/Viet_Nam_100m_Population/VNM_ppp_v2b_2020_UNadj.tif",
"VNM_ppp_v2b_2020_UNadj.tif")
Loading the data:
worldpop <- read_stars("VNM_ppp_v2b_2020_UNadj.tif")
Downloading the population mobility data:
download.file("https://www.dropbox.com/s/6fl62gcuma9890f/google.rds?raw=1", "google.rds")
download.file("https://www.dropbox.com/s/uuxxjm3cgs0a4gw/apple.rds?raw=1", "apple.rds")
Loading the data:
google <- readRDS("google.rds")
apple <- readRDS("apple.rds") %>%
mutate_if(is.numeric, subtract, 100)
Downloading the polygons from GADM:
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_0_sf.rds", "gadm36_VNM_0_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_1_sf.rds", "gadm36_VNM_1_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_2_sf.rds", "gadm36_VNM_2_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm2.rds" , "VNM_adm2.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm3.rds" , "VNM_adm3.rds")
Loading the polygons:
vn0 <- readRDS("gadm36_VNM_0_sf.rds") # country polygon
vn1 <- readRDS("gadm36_VNM_1_sf.rds") # provinces polygons
vn2 <- readRDS("gadm36_VNM_2_sf.rds") %>% # districts polygons
transmute(province = str_squish(NAME_1),
district = str_squish(NAME_2) %>%
str_remove_all("Thành Phố |Thị Xã |Quận "))
vn2_old <- readRDS("VNM_adm2.rds") %>% # old version of the districts polygons
st_as_sf() %>%
transmute(province = str_squish(NAME_1),
district = str_squish(NAME_2) %>%
str_remove_all("Thành Phố |Thị Xã |Quận ") %>%
str_replace("Chiêm Hoá", "Chiêm Hóa"))
vn3_old <- readRDS("VNM_adm3.rds") %>% # old version of the communes polygons
st_as_sf() %>%
transmute(province = str_squish(NAME_1),
district = str_squish(NAME_2) %>%
str_remove_all("Thành Phố |Thị Xã |Quận ") %>%
str_replace("Chiêm Hoá", "Chiêm Hóa"),
commune = str_squish(NAME_3)) %>%
arrange(province, district, commune)
Removing the commune Huæi Luông from the district Sìn Hồ:
vn2_old %<>%
filter(district == "Sìn Hồ") %>%
st_set_geometry(st_union(filter(vn3_old, district == "Sìn Hồ", commune != "Huæi Luông"))) %>%
rbind(filter(vn2_old, district != "Sìn Hồ")) %>%
arrange(province, district)
Downloading the administrative boundaries from OpenStreetMap:
con_dao <- c(106.523384, 8.622214, 106.748218, 8.782639) %>%
opq() %>%
add_osm_feature(key = "boundary", value = "administrative") %>%
osmdata_sf()
The main island is made of lines. Let’s transform them into a polygon:
main_island <- con_dao$osm_lines %>%
st_combine() %>%
st_polygonize() %>%
st_geometry() %>%
st_collection_extract("POLYGON")
The other islands are already polygons. Let’s extract and merge them with the newly created polygon of the main island:
con_dao <- con_dao$osm_polygons %>%
st_geometry() %>%
c(main_island) %>%
st_multipolygon() %>%
st_sfc(crs = st_crs(vn2))
Here are the polygons for Côn Đảo:
con_dao %>%
st_geometry() %>%
plot(col = "grey")
Let’s add the polygon of Côn Đảo to the GADM data:
vn2 %<>%
filter(province == "Bà Rịa - Vũng Tàu") %>%
head(1) %>%
mutate(district = "Côn Đảo") %>%
st_set_geometry(con_dao) %>%
rbind(vn2)
For some reason the province of Vĩnh Long is missing… We add information form Wikipedia:
census <- census_path %>%
readRDS() %>%
group_by(province, district) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
mutate(province = str_squish(province) %>%
str_remove_all("Thành phố |Tỉnh ") %>%
str_replace("Đăk Lăk" , "Đắk Lắk") %>%
str_replace("Đăk Nông" , "Đắk Nông") %>%
str_replace("Khánh Hoà" , "Khánh Hòa") %>%
str_replace("Thanh Hoá" , "Thanh Hóa"),
district = str_squish(district) %>%
str_replace("Thành phố Cao Lãnh" , "Cao Lãnh (Thành phố)") %>%
str_replace("Thị xã Hồng Ngự" , "Hồng Ngự (Thị xã)") %>%
str_replace("Thị xã Kỳ Anh" , "Kỳ Anh (Thị xã)") %>%
str_replace("Thị xã Long Mỹ" , "Long Mỹ (Thị xã)") %>%
str_replace("Thị xã Cai Lậy" , "Cai Lậy (Thị xã)") %>%
str_replace("Thị xã Duyên Hải" , "Duyên Hải (Thị xã)") %>%
str_remove_all("Huyện |Huỵên |Quận |Thành phố |Thành Phô |Thành Phố |Thị xã |Thị Xã ") %>%
str_replace("Hoà Vang" , "Hòa Vang") %>%
str_replace("Ứng Hoà" , "Ứng Hòa") %>%
str_replace("Đồng Phù" , "Đồng Phú") %>%
str_replace("Đắk Glong" , "Đăk Glong") %>%
str_replace("Ia H’Drai" , "Ia H' Drai") %>%
str_replace("ý Yên" , "Ý Yên") %>%
str_replace("Bác ái" , "Bác Ái") %>%
str_replace("Phan Rang- Tháp Chàm", "Phan Rang-Tháp Chàm") %>%
str_replace("Đông Hoà" , "Đông Hòa") %>%
str_replace("Tuy Hòa" , "Tuy Hoà") %>%
str_replace("Thiệu Hoá" , "Thiệu Hóa")) %>%
bind_rows(
bind_cols(
data.frame(province = rep("Vĩnh Long", 8)),
tribble(
~district , ~n,
"Bình Tân" , 93758, # 2009
"Long Hồ" , 160537, # 2018
"Mang Thít", 103573, # 2018
"Tam Bình" , 162191, # 2003
"Trà Ôn" , 149983, # 2003
"Vũng Liêm", 176233, # 2003
"Bình Minh", 95282, # 2003
"Vĩnh Long", 200120 # 2018
)
),
tribble(
~province , ~district , ~n,
"Điện Biên", "Mường Ảng" , 37077, # 2006
"Hải Phòng", "Bạch Long Vĩ", 912, # 2018
"Phú Thọ" , "Thanh Sơn" , 187700, # 2003
"Quảng Trị", "Cồn Cỏ" , 400 # 2003
)
)
Let’s check that the names of the provinces in the GADM and census data sets are the same:
identical(sort(unique(census$province)), sort(unique(vn2$province)))
## [1] TRUE
Let’s check that the districts in the GADM and the census data are the same:
nrow(anti_join(census, vn2, c("province", "district"))) < 1
## [1] TRUE
nrow(anti_join(vn2, census, c("province", "district"))) < 1
## [1] TRUE
Let’s merge the census and GADM data sets:
vn2 %<>% left_join(census, c("province", "district"))
Let’s check that no district is duplicated:
# If true, then no district is duplicated
vn2 %>%
st_drop_geometry() %>%
as.data.table() %>%
setkey(province, district) %>%
duplicated() %>%
sum() %>%
is_less_than(1)
## [1] TRUE
The colocation data use the Bing polygons, which do not seem to be very much up to date. In order to adjust for that, we need to merge the following districts in the GADM data:
Furthermore, some of the districts need to be split in two, with each part merged to a different district. That’s the case for these 3 districts:
As illustrated below:
plot_districts <- function(d1, d2, d3) {
tmp <- vn2 %>%
filter(district %in% c(d1, d2, d3)) %>%
st_geometry()
plot(tmp)
plot(worldpop, add = TRUE, main = NA)
plot(tmp, add = TRUE, col = adjustcolor("green", .2))
vn2 %>%
filter(district == d1) %>%
st_geometry() %>%
plot(add = TRUE, col = adjustcolor("red", .2))
vn2_old %>%
filter(district %in% c(d2, d3)) %>%
st_geometry() %>%
plot(add = TRUE, border = "blue", lwd = 2)
}
plot_districts("Nậm Pồ" , "Mường Chà", "Mường Nhé")
plot_districts("Lâm Bình", "Nà Hang" , "Chiêm Hóa")
plot_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ")
The raster data shows the population density from WorldPop that we’ll use to split the population of the red district into the 2 blue districts. Here are two functions to fix these 2 above-mentioned problems. Here is the first function:
merge_districts <- function(vn, d1, d2, p) {
dst <- c(d1, d2)
tmp <- vn %>%
filter(district %in% dst, province == p) %>%
mutate(n = sum(n))
tmp %<>%
filter(district %in% d1) %>%
st_set_geometry(st_union(tmp))
vn %>%
filter(! (district %in% dst & province == p)) %>%
rbind(tmp) %>%
arrange(province, district)
}
The second function needs this function:
proportion <- function(to_cut, one_district, new_vn = vn2, old_vn = vn2_old, rstr = worldpop) {
to_cut <- filter(new_vn, district == to_cut)
one_part <- st_intersection(to_cut, filter(old_vn, district == one_district))
wp0 <- rstr %>%
st_crop(to_cut) %>%
st_as_stars() %>%
unlist() %>%
sum(na.rm = TRUE)
rstr %>%
st_crop(one_part) %>%
st_as_stars() %>%
unlist() %>%
sum(na.rm = TRUE) %>%
divide_by(wp0)
}
Let’s try it:
proportion("Nậm Nhùn", "Mường Tè" , vn2, vn2_old, worldpop)
## [1] 0.6717576
proportion("Nậm Pồ" , "Mường Chà", vn2, vn2_old, worldpop)
## [1] 0.4612189
proportion("Lâm Bình", "Nà Hang" , vn2, vn2_old, worldpop)
## [1] 0.7201902
And here is the second function we needed:
merge_back_districts <- function(c2, d1, d2, d3, c1 = vn2_old, rstr = worldpop) {
dsts <- c(d1, d2, d3)
tmp <- c2 %>%
filter(district %in% dsts) %$%
setNames(n, district)
half1 <- round(proportion(d1, d2, c2, c1, rstr) * tmp[d1])
half2 <- tmp[d1] - half1
tmp[d2] <- tmp[d2] + half1
tmp[d3] <- tmp[d3] + half2
tmp <- tmp[dsts[-1]]
c1 %>%
filter(district %in% dsts[-1]) %>%
mutate(n = tmp) %>%
select(everything(), geometry) %>%
rbind(filter(c2, ! district %in% dsts)) %>%
arrange(province, district)
}
Let’s now call these 2 functions to do the mergings:
vn2 %<>%
merge_districts("Kỳ Anh" , "Kỳ Anh (Thị xã)" , "Hà Tĩnh") %>%
merge_districts("Long Mỹ" , "Long Mỹ (Thị xã)" , "Hậu Giang") %>%
merge_districts("Cai Lậy" , "Cai Lậy (Thị xã)" , "Tiền Giang") %>%
merge_districts("Duyên Hải" , "Duyên Hải (Thị xã)", "Trà Vinh") %>%
merge_districts("Tân Uyên" , "Bắc Tân Uyên" , "Bình Dương") %>%
merge_districts("Bến Cát" , "Bàu Bàng" , "Bình Dương") %>%
merge_districts("Bắc Từ Liêm", "Nam Từ Liêm" , "Hà Nội") %>% # then rename to Từ Liêm
merge_districts("Mộc Hóa" , "Kiến Tường" , "Long An") %>%
merge_districts("Quỳnh Lưu" , "Hoàng Mai" , "Nghệ An") %>%
merge_districts("Quảng Trạch", "Ba Đồn" , "Quảng Bình") %>%
merge_back_districts("Nậm Pồ" , "Mường Chà", "Mường Nhé") %>%
merge_back_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ") %>%
merge_back_districts("Lâm Bình", "Nà Hang" , "Chiêm Hóa") %>%
mutate(district = str_replace(district, "Bắc Từ Liêm", "Từ Liêm")) # here the renaming
Let’s calculate and add the areas:
vn2 %<>%
mutate(area_km2 = vn2 %>%
st_geometry() %>%
st_area() %>%
as.numeric() %>%
divide_by(1e6)) %>%
select(everything(), geometry)
head(vn2)
## Simple feature collection with 6 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 105.0233 ymin: 10.30302 xmax: 105.5753 ymax: 10.96207
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## province district n area_km2 geometry
## 1 An Giang An Phú 183074 225.1253 MULTIPOLYGON (((105.1216 10...
## 2 An Giang Châu Đốc 117244 104.0911 MULTIPOLYGON (((105.1387 10...
## 3 An Giang Châu Phú 255894 450.9987 MULTIPOLYGON (((105.327 10....
## 4 An Giang Châu Thành 176945 354.9705 MULTIPOLYGON (((105.3082 10...
## 5 An Giang Chợ Mới 353705 369.1722 MULTIPOLYGON (((105.4729 10...
## 6 An Giang Long Xuyên 303381 114.2415 MULTIPOLYGON (((105.4729 10...
Let’s calculate and add the population densities:
vn2 %<>%
mutate(den_km2 = n / area_km2) %>%
select(everything(), geometry)
head(vn2)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 105.0233 ymin: 10.30302 xmax: 105.5753 ymax: 10.96207
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## province district n area_km2 den_km2 geometry
## 1 An Giang An Phú 183074 225.1253 813.2095 MULTIPOLYGON (((105.1216 10...
## 2 An Giang Châu Đốc 117244 104.0911 1126.3596 MULTIPOLYGON (((105.1387 10...
## 3 An Giang Châu Phú 255894 450.9987 567.3941 MULTIPOLYGON (((105.327 10....
## 4 An Giang Châu Thành 176945 354.9705 498.4781 MULTIPOLYGON (((105.3082 10...
## 5 An Giang Chợ Mới 353705 369.1722 958.1031 MULTIPOLYGON (((105.4729 10...
## 6 An Giang Long Xuyên 303381 114.2415 2655.6104 MULTIPOLYGON (((105.4729 10...
The distribution of the districts’ population sizes:
hist2(vn2$n, n = 50, xlab = "population size", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)
Let’s define a palette of colors:
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
The distribution of the districts’ population sizes where all the bars are of the same area and represent one decile of the data:
hist2(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = "population size", ylab = "density of probability")
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)
quantile(vn2$n, seq(0, 1, le = 11))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 400.0 44849.5 64843.2 85749.6 104927.4 122217.0 144226.2 163635.0
## 80% 90% 100%
## 195393.4 249490.8 1035938.0
Let’s map the population sizes of the districts:
vn2 %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
The mean and variance of the district population sizes:
mean(vn2$n)
## [1] 139978.6
median(vn2$n)
## [1] 122217
The distribution of the districts’ areas:
hist2(vn2$area_km2, n = 50,
xlab = expression(paste("area ", (km^2))), ylab = "number of districts")
Mean and variance of the districts’s areas:
mean(vn2$area_km2)
## [1] 471.8696
median(vn2$area_km2)
## [1] 339.3427
Some quantiles of the districts’ densities:
(quants <- quantile(vn2$den_km2, c(.025, .25, .5, .75, .975)))
## 2.5% 25% 50% 75% 97.5%
## 34.62209 115.50101 397.09968 1019.86255 20341.46013
The distribution of the districts’ densities, on a log scale, with quantiles:
hist2(log10(as.numeric(vn2$den_km2)), n = 50, axes = FALSE,
xlab = expression(paste("density (/", km^2, ")")), ylab = "number of districts")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)
abline(v = log10(quants), lty = c(3, 2, 1, 2, 3), col = "blue", lwd = 2)
Same distribution as above where all the bars have the same area representing 10% of the data:
xs <- log10(vn2$den_km2)
hist2(xs, quantile(xs, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = expression(paste("density (/", km^2, ")")), ylab = "density of probability")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)
Mapping the districts’ populations densities:
vn2 %>%
select(den_km2) %>%
st_geometry() %>%
plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)
vn0 %>%
st_geometry() %>%
plot(add = TRUE)
The relationship between the population size and the population density:
vn2 %$%
plot(log10(n), log10(den_km2), col = "blue", axes = FALSE, xlab = "population size",
ylab = expression(paste("population density (/", km^2, ")")))
axis(1, 3:6, format(10^(3:6), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10^(1:4), big.mark = ",", scientific = FALSE, trim = TRUE))
meaning that the density is increasing with some power of the population size.
The list of colocation data files:
files <- dir(colocation_path)
There are 17 of them:
length(files)
## [1] 17
Making names from files names:
weeks <- str_remove_all(files, "^.*__|.csv")
Loading the colocation data into a list (one slot per week):
colocation <- paste0(colocation_path, dir(colocation_path)) %>%
map(readr::read_csv) %>%
setNames(weeks) %>%
map(select, -country, -ds) # remove the country code (useless) and the ds (which is the name of the slot)
The colocation data looks like this:
head(colocation, 1)
## $`2020-03-03`
## # A tibble: 448,665 x 13
## polygon1_id polygon1_name lon_1 lat_1 name_stack_1 fb_population_1
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 834298 Huyện Lý Sơn 109. 15.4 Quảng Ngãi … 156
## 2 834671 Huyện Đạ Tẻh 108. 11.6 Lâm Đồng Pr… 123
## 3 834269 Huyện Hương … 107. 16.4 Thừa Thiên … 850
## 4 834290 Huyện Hiệp Đ… 108. 15.5 Quảng Nam P… 206
## 5 834320 Thị Xã Bình … 107. 11.7 Bình Phước … 991
## 6 834672 Huyện Cát Ti… 107. 11.7 Lâm Đồng Pr… 109
## 7 834652 Huyện Tuy Ph… 109. 11.4 Bình Thuận … 1555
## 8 834798 Huyện Đan Ph… 106. 21.1 Hanoi City … 2640
## 9 834599 Huyện Đông H… 106. 20.5 Thái Bình P… 1819
## 10 834336 Huyện Vĩnh H… 106. 10.9 Long An Pro… 422
## # … with 448,655 more rows, and 7 more variables: polygon2_id <dbl>,
## # polygon2_name <chr>, lon_2 <dbl>, lat_2 <dbl>, name_stack_2 <chr>,
## # fb_population_2 <dbl>, link_value <dbl>
The slot names are the last day of the 7-day period over which the data are collected:
names(colocation)
## [1] "2020-03-03" "2020-03-10" "2020-03-17" "2020-03-24" "2020-03-31"
## [6] "2020-04-07" "2020-04-14" "2020-04-21" "2020-04-28" "2020-05-05"
## [11] "2020-05-12" "2020-05-19" "2020-05-26" "2020-06-02" "2020-06-09"
## [16] "2020-06-16" "2020-06-23"
Here we show what the problem is (i.e. some of the data are outside Vietnam). The following function plots the colocation data:
plot_fb <- function(df, xlim, ylim, col) {
plot(xlim, ylim, asp = 1, xlab = "longitude", ylab = "latitude", type = "n")
maps::map(col = "grey", fill = TRUE, add = TRUE)
points(df[[1]], df[[2]], pch = ".", col = col)
axis(1)
axis(2)
box(bty = "o")
}
Let’s consider one week:
june23 <- colocation$`2020-06-23`
The locations in this week are within these boundaries:
(xlim <- range(range(june23$lon_1), range(june23$lon_2)))
## [1] 101.5709 109.3583
(ylim <- range(range(june23$lat_1), range(june23$lat_2)))
## [1] 8.668113 23.237631
Let’s plot the polygon 1:
june23 %>%
select(lon_1, lat_1) %>%
distinct() %>%
plot_fb(xlim, ylim, "blue")
And the polygon 2:
june23 %>%
select(lon_2, lat_2) %>%
distinct() %>%
plot_fb(xlim, ylim, "red")
In the following function, df is a data frame with the same column names as a “colocation map” data frame. pl is an sf non-projected polygon. type is either 1 or 2.
pts_in_pol <- function(type, df, pl, project = FALSE) {
# assumes that sf is not projected.
# 4326: non-projected
# 3857: pseudo-Mercator (e.g. Google Maps)
df %<>% st_as_sf(coords = paste0(c("lon_", "lat_"), type), crs = 4326)
if (project) {
df %<>% st_transform(3857)
pl %<>% st_transform(3857)
}
df %>%
st_intersects(pl) %>%
map_int(length)
}
The function returns a vector of length equal to the number of rows of df with 1 if the points is inside the polygon pl and 0 otherwise. Let’s try it:
tmp <- pts_in_pol(1, june23, vn0)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
It takes 3.5’ and it’s about 3 times slower if we project the points and polygon. The arguments of the following function are the same as the previous one. It uses the previous one to delete the records from df that have start and end coordinates that are outside pl:
rcd_in_pol <- function(df, pl, project = FALSE) {
require(magrittr)
1:2 %>%
parallel::mclapply(pts_in_pol, df, pl, project, mc.cores = 2) %>%
as.data.frame() %>%
rowSums() %>%
is_greater_than(0) %>%
magrittr::extract(df, ., ) # there is an extract() function in tidyr too
}
Let’s try it:
june23 %<>% rcd_in_pol(vn0)
Let’s process all the data (takes about 1 minute):
colocation %<>% map(rcd_in_pol, vn0)
colocation <- readRDS("colocation2.rds")
The list of district in the colocation data:
col_names <- c("polygon_id", "polygon_name", "lon", "lat", "name_stack")
districts1 <- map_dfr(colocation, select, polygon1_id, polygon1_name, lon_1, lat_1, name_stack_1) %>%
setNames(col_names)
districts2 <- map_dfr(colocation, select, polygon2_id, polygon2_name, lon_2, lat_2, name_stack_2) %>%
setNames(col_names)
districts <- bind_rows(districts1, districts2) %>%
distinct()
This is what it looks like:
districts
## # A tibble: 693 x 5
## polygon_id polygon_name lon lat name_stack
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 834298 Huyện Lý Sơn 109. 15.4 Quảng Ngãi Province // Huyện Lý Sơn
## 2 834671 Huyện Đạ Tẻh 108. 11.6 Lâm Đồng Province // Huyện Đạ Tẻh
## 3 834269 Huyện Hương Trà 107. 16.4 Thừa Thiên Huế Province // Huyện Hươ…
## 4 834290 Huyện Hiệp Đức 108. 15.5 Quảng Nam Province // Huyện Hiệp Đức
## 5 834320 Thị Xã Bình Long 107. 11.7 Bình Phước Province // Thị Xã Bình L…
## 6 834672 Huyện Cát Tiên 107. 11.7 Lâm Đồng Province // Huyện Cát Tiên
## 7 834652 Huyện Tuy Phong 109. 11.4 Bình Thuận Province // Huyện Tuy Pho…
## 8 834798 Huyện Đan Phượng 106. 21.1 Hanoi City // Huyện Đan Phượng
## 9 834599 Huyện Đông Hưng 106. 20.5 Thái Bình Province // Huyện Đông Hưng
## 10 834336 Huyện Vĩnh Hưng 106. 10.9 Long An Province // Huyện Vĩnh Hưng
## # … with 683 more rows
saveRDS(districts, file="coloc_district.RDS")
In the colocation data, the name_stack_* variables contains the names of the province and the district separated by //. The problem is that there are a number of districts that do not have // in their name_stack variable and all of them seem to be in the province of Hanoi:
plot(st_geometry(vn0), col = "grey")
vn1 %>%
filter(NAME_1 == "Hà Nội") %>%
st_geometry() %>%
plot(add = TRUE, col = "yellow")
districts %>%
filter(! grepl(" // ", name_stack)) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_geometry() %>%
plot(add = TRUE, col = "red")
Plus a number of names fixes:
districts %<>%
separate(name_stack, c("province", "district"), " // ") %>%
mutate(indicate = is.na(district),
district = ifelse(indicate, province, district),
province = ifelse(indicate, "Hanoi City", province) %>%
str_squish() %>%
str_remove(" Province| City") %>%
str_replace("-", " - ") %>%
str_replace("Da Nang" , "Đà Nẵng") %>%
str_replace("Hanoi" , "Hà Nội") %>%
str_replace("Hai Phong" , "Hải Phòng") %>%
str_replace("Ho Chi Minh" , "Hồ Chí Minh") %>%
str_replace("Hòa Bình" , "Hoà Bình"),
polygon_name = str_squish(polygon_name) %>%
str_replace("Thành Phố Cao Lãnh", "Cao Lãnh (Thành phố)") %>%
str_replace("Thị Xã Hồng Ngự", "Hồng Ngự (Thị xã)") %>%
str_remove("Huyện |Thành phố |Thị xã |Quận |Thành Phố |Thị Xã ") %>%
str_replace("Quy Nhơn" , "Qui Nhơn") %>%
str_replace("Đảo Phú Quý" , "Phú Quí") %>%
str_replace("Bình Thủy" , "Bình Thuỷ") %>%
str_replace("Hòa An" , "Hoà An") %>%
str_replace("Phục Hòa" , "Phục Hoà") %>%
str_replace("Thái Hòa" , "Thái Hoà") %>%
str_replace("Hạ Hòa" , "Hạ Hoà") %>%
str_replace("Phú Hòa" , "Phú Hoà") %>%
str_replace("Tây Hòa" , "Tây Hoà") %>%
str_replace("Tuy Hòa" , "Tuy Hoà") %>%
str_replace("Krông Ana" , "Krông A Na") %>%
str_replace("Krông A Na" , "Krông A Na") %>% ##
str_replace("Krông Păk" , "Krông Pắc") %>%
str_replace("Krông Pắc" , "Krông Pắc") %>% ##
str_replace("Đắk Glong" , "Đăk Glong") %>%
str_replace("Đắk Rlấp" , "Đắk R'Lấp") %>%
str_replace("A Yun Pa" , "Ayun Pa") %>%
str_replace("Từ Liêm" , "Nam Từ Liêm") %>%
str_replace("Kiến Thụy" , "Kiến Thuỵ") %>%
str_replace("Thủy Nguyên" , "Thuỷ Nguyên") %>%
str_replace("Vị Thủy" , "Vị Thuỷ") %>%
str_replace("Bác Ai" , "Bác Ái") %>%
str_replace("Thanh Thủy" , "Thanh Thuỷ") %>%
str_replace("Yên Hưng" , "Quảng Yên") %>%
str_replace("Na Hang" , "Nà Hang") %>%
str_replace("Mù Cang Chải", "Mù Căng Chải") %>%
str_replace("M`Đrắk" , "M'Đrắk") %>%
str_replace("Cư M`Gar" , "Cư M'gar") %>%
str_replace("Ea H`Leo" , "Ea H'leo") %>%
str_replace("Nam Từ Liêm" , "Từ Liêm") %>%
str_replace("Buôn Hồ" , "Buôn Hồ"),
polygon_name = ifelse(province == "Bạc Liêu" & polygon_name == "Hòa Bình",
"Hoà Bình", polygon_name)) %>%
select(-indicate, -district) %>%
rename(district = polygon_name)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [82, 210,
## 309, 593].
The warnings are produced when the provinces names are missing for some of the districts of Hanoi. Some districts do not have any information in the colocation data:
anti_join(districts, vn2, c("province", "district"))
## # A tibble: 0 x 5
## # … with 5 variables: polygon_id <dbl>, district <chr>, lon <dbl>, lat <dbl>,
## # province <chr>
anti_join(vn2, districts, c("province", "district"))
## Simple feature collection with 5 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 104.636 ymin: 11.60531 xmax: 107.7464 ymax: 21.04582
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## province district n area_km2 den_km2
## 1 Bình Phước Phú Riềng 97931 667.330016 146.75048
## 2 Hải Phòng Bạch Long Vĩ 912 15.273999 59.70931
## 3 Kon Tum Ia H' Drai 6638 1159.609633 5.72434
## 4 Quảng Trị Cồn Cỏ 400 2.176718 183.76293
## 5 Sơn La Vân Hồ 61197 969.335926 63.13291
## geometry
## 1 MULTIPOLYGON (((106.8315 11...
## 2 MULTIPOLYGON (((107.7457 20...
## 3 MULTIPOLYGON (((107.4588 13...
## 4 MULTIPOLYGON (((107.343 17....
## 5 MULTIPOLYGON (((105.0134 20...
Let’s map these districts that never have any data:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
tmp <- vn2 %>%
mutate(pd = paste(province, district)) %>%
filter(! pd %in% with(districts, paste(province, district)))
tmp %>%
st_geometry() %>%
plot(add = TRUE, col = "red")
tmp %>%
st_centroid() %>%
st_coordinates() %>%
as_tibble() %>%
filter(between(Y, 15, 20.5)) %>%
st_as_sf(coords = c("X", "Y")) %>%
plot(add = TRUE, col = "red")
Let’s add these 5 districts to districts:
tmp <- vn2 %>%
mutate(pd = paste(province, district)) %>%
filter(! pd %in% with(districts, paste(province, district)))
districts <- tmp %>%
st_centroid() %>%
st_coordinates() %>%
as_tibble() %>%
transmute(lon = X, lat = Y) %>%
bind_cols(tmp) %>%
mutate(polygon_id = 850001:850005) %>%
select(polygon_id, district, lon, lat, province) %>%
bind_rows(districts)
This is what it looks like now:
districts
## # A tibble: 698 x 5
## polygon_id district lon lat province
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 850001 Phú Riềng 107. 11.7 Bình Phước
## 2 850002 Bạch Long Vĩ 108. 20.1 Hải Phòng
## 3 850003 Ia H' Drai 108. 14.2 Kon Tum
## 4 850004 Cồn Cỏ 107. 17.2 Quảng Trị
## 5 850005 Vân Hồ 105. 20.8 Sơn La
## 6 834298 Lý Sơn 109. 15.4 Quảng Ngãi
## 7 834671 Đạ Tẻh 108. 11.6 Lâm Đồng
## 8 834269 Hương Trà 107. 16.4 Thừa Thiên Huế
## 9 834290 Hiệp Đức 108. 15.5 Quảng Nam
## 10 834320 Bình Long 107. 11.7 Bình Phước
## # … with 688 more rows
Let’s merge this information with the polygon / census data:
vn2 %<>%
left_join(districts, c("province", "district")) %>%
select(polygon_id, province, district, n, area_km2, den_km2, lon, lat, geometry)
and this what it looks like now:
vn2
## Simple feature collection with 698 features and 8 fields
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: 102.145 ymin: 8.381355 xmax: 109.4692 ymax: 23.39269
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 10 features:
## polygon_id province district n area_km2 den_km2 lon lat
## 1 834464 An Giang An Phú 183074 225.1253 813.2095 105.1009 10.84574
## 2 834463 An Giang Châu Đốc 117244 104.0911 1126.3596 105.0873 10.67470
## 3 834466 An Giang Châu Phú 255894 450.9987 567.3941 105.1797 10.55759
## 4 834468 An Giang Châu Thành 176945 354.9705 498.4781 105.2659 10.42124
## 5 834469 An Giang Chợ Mới 353705 369.1722 958.1031 105.4614 10.47487
## 6 834462 An Giang Long Xuyên 303381 114.2415 2655.6104 105.4280 10.37121
## 7 834465 An Giang Phú Tân 214823 311.6955 689.2079 105.2758 10.65475
## 8 834873 An Giang Tân Châu 176007 172.0770 1022.8388 105.1859 10.79910
## 9 834470 An Giang Thoại Sơn 189389 468.9065 403.8950 105.2556 10.29679
## 10 834467 An Giang Tịnh Biên 124798 354.1079 352.4293 105.0113 10.54809
## geometry
## 1 MULTIPOLYGON (((105.1216 10...
## 2 MULTIPOLYGON (((105.1387 10...
## 3 MULTIPOLYGON (((105.327 10....
## 4 MULTIPOLYGON (((105.3082 10...
## 5 MULTIPOLYGON (((105.4729 10...
## 6 MULTIPOLYGON (((105.4729 10...
## 7 MULTIPOLYGON (((105.327 10....
## 8 MULTIPOLYGON (((105.2463 10...
## 9 MULTIPOLYGON (((105.2532 10...
## 10 MULTIPOLYGON (((105.1139 10...
The following function combines the facebook data with the GADM / census data for each week:
dist_fb_populations <- function(x) {
x %>%
select(polygon1_id, fb_population_1) %>%
distinct() %>%
right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
}
Let’s compute it:
tmp <- map(colocation, dist_fb_populations)
The facebook population of each district does not seem to change as a function of time:
xs <- ymd(names(colocation)) - 6
plot(xs, seq_along(xs), ylim = c(0, 5), type = "n")
tmp %>%
map_dfc(pull, fb_population_1) %>%
t() %>%
as.data.frame() %>%
map(log10) %>%
walk(lines, x = xs, col = adjustcolor("black", .25))
And the distribution accross district looks like this:
tmp %>%
map(mutate, prop = fb_population_1 / n) %>%
first() %>%
pull(prop) %>%
hist2(50, xlab = "facebook coverage", ylab = "number of districts")
Let’s look at the facebook coverage as a function of the population size for the first week of the colocation data:
tmp <- colocation$`2020-03-03` %>%
select(polygon1_id, fb_population_1) %>%
distinct() %>%
right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
summary(lm(log10(fb_population_1) ~ log10(n), tmp))
##
## Call:
## lm(formula = log10(fb_population_1) ~ log10(n), data = tmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03536 -0.19070 -0.01065 0.18854 1.33225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.55716 0.18196 -25.05 <2e-16 ***
## log10(n) 1.48787 0.03592 41.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2796 on 690 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.7132, Adjusted R-squared: 0.7128
## F-statistic: 1716 on 1 and 690 DF, p-value: < 2.2e-16
with(tmp, plot(log10(n), log10(fb_population_1), col = "blue", axes = FALSE,
ylim = c(1, 4.5), xlim = c(3.8, 6),
xlab = "district population", ylab = "facebook population"))
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))
Meaning that the facebook coverage increases with population size.
There is no missing value and no zeros in the link variable:
link_val <- map(colocation, pull, link_value)
link_val2 <- unlist(link_val)
any(is.na(link_val))
## [1] FALSE
range(link_val)
## [1] 2.716479e-12 2.276358e-01
A function that computes the number of non-null links per district:
nb_links <- function(x) {
x %>%
group_by(polygon1_id) %>%
tally() %>%
right_join(select(districts, c("polygon1_id" = "polygon_id"))) %>%
mutate(n = replace_na(n, 0)) %>%
pull(n)
}
Let’s look at the distribution of the number of non-null links per district as a function of time:
cols <- RColorBrewer::brewer.pal(9, "Blues")
ld <- ymd(c(20200401, 20200423)) # dates of the lockdown
xlim <- ymd(c(20200229, 20200701))
ys <- c(0, 700)
tmp <- colocation %>%
map(nb_links) %>%
map_dfc(quantile, 0:10 / 10) %>%
t() %>%
as.data.frame()
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
## Joining, by = "polygon1_id"
xs_tr <- c(xs[1], rep(xs[-1], each = 2), last(xs) + mean(diff(xs)))
xs2 <- c(xs_tr, rev(xs_tr))
plot(xs, seq_along(xs), ylim = c(0, 700), type = "n", xlim = xlim,
xlab = NA, ylab = "number of non-null links per district")
for (i in 1:5)
polygon(xs2, c(rep(tmp[[i]], each = 2), rev(rep(tmp[[12 - i]], each = 2))),
col = cols[i], border = NA)
lines(xs_tr, rep(tmp[[6]], each = 2), col = cols[6], lwd = 2)
abline(h = nrow(districts), col = "grey", lty = 2)
polygon(c(ld, rev(ld)), rep(ys, each = 2), col = adjustcolor("red", .1), border = NA)
where the red area materializes the lockdown.
Let’s look at the sum of the probability per week:
plot(xs, map_dbl(link_val, sum), type = "s", col = "blue", xlim = xlim,
xlab = NA, ylab = "sums of probabilities", lwd = 2)
polygon(c(ld, rev(ld)), rep(ys, each = 2), col = adjustcolor("red", .25), border = NA)
The 10th week looks suspicious.
Let’s generate a template with all the combinations of districts:
template <- districts %>%
arrange(polygon_id) %$%
expand.grid(polygon_id, polygon_id) %>%
as_tibble() %>%
setNames(c("polygon1_id", "polygon2_id"))
The function that transforms the colocation data into a matrix:
to_matrix <- function(df, template) {
dim_names <- sort(unique(template$polygon1_id))
df %>%
select(polygon1_id, polygon2_id, link_value) %>%
left_join(template, ., c("polygon1_id", "polygon2_id")) %>%
mutate(link_value = replace_na(link_value, 0)) %>%
pull(link_value) %>%
matrix(nrow(districts)) %>%
`colnames<-`(dim_names) %>%
`rownames<-`(dim_names)
}
Let’s do it for all the weeks and sum them:
coloc_mat <- colocation %>%
map(to_matrix, template) %>%
reduce(`+`)
Let’s have a look at this matrix. Let’s first order the district from south to north and from west to east:
hash <- setNames(seq_along(colnames(coloc_mat)),
colnames(coloc_mat))
ind <- districts %>%
arrange(lat, lon) %>%
pull(polygon_id) %>%
as.character() %>%
magrittr::extract(hash, .)
coloc_mat <- coloc_mat[ind, ind]
Let’s now plot the matrix:
opar <- par(pty = "s")
image(log10(apply(t(coloc_mat), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")
We can verify that the matrix is symmetric and that both Hanoi and Saigon are well connected to everywhere in the country. We can see also the district that are not connected at all.
Let’s say we want to select all the provinces from the Northern EPI, the southernmost province of which is Nghệ An. Let’s retrieve the latitude of the centroids of all the provinces:
tmp <- vn1 %>%
st_centroid() %>%
st_coordinates() %>%
as_tibble() %>%
pull(Y) %>%
mutate(vn1, lat_cent = .) %>%
select(NAME_1, lat_cent)
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
The province south of Nghệ An is Hà Tĩnh, the latitude of centroid of which is:
threshold <- tmp %>%
filter(NAME_1 == "Hà Tĩnh") %>%
pull(lat_cent)
Now, we retrieve all the names of all the provinces that are north of this threshold:
northernEPI <- tmp %>%
filter(lat_cent > threshold) %>%
pull(NAME_1)
Now, we retrieve the corresponding districts’ ID:
sel <- districts %>%
filter(province %in% northernEPI) %>%
pull(polygon_id)
And now we can subset our matrix:
sel <- colnames(coloc_mat) %in% sel
coloc_mat_northernEPI <- coloc_mat[sel, sel]
hash <- setNames(seq_along(colnames(coloc_mat_northernEPI)),
colnames(coloc_mat_northernEPI))
ind <- districts %>%
filter(province %in% northernEPI) %>%
arrange(lat, lon) %>%
pull(polygon_id) %>%
as.character() %>%
magrittr::extract(hash, .)
coloc_mat_northernEPI <- coloc_mat_northernEPI[ind, ind]
Which gives:
opar <- par(pty = "s")
image(log10(apply(t(coloc_mat_northernEPI), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")
colocation$`2020-03-03` %>%
select(polygon1_id, link_value, fb_population_1) %>%
group_by(polygon1_id) %>%
summarise(link = sum(link_value),
fb_pop = unique(fb_population_1)) %>%
map(log) %$%
plot(fb_pop, link)
## `summarise()` ungrouping output (override with `.groups` argument)
The following model uses colocation data to formulate the coupling in a SEIR metapopulation model. The equations that describe the epidynamics in each of the populations is given below: \[ \begin{aligned} \frac{dS_i}{dt} & = - S_i \sum_{k} \beta_{ik} \frac{I_k}{N_k} \\ \frac{dE_i}{dt} & = S_i \sum_{k} \beta_{ik} \frac{I_k}{N_k} - \sigma E_i \\ \frac{dI_i}{dt} & = \sigma E_i - \gamma I_i \\ \frac{dR_i}{dt} & = \gamma I_i \\ \end{aligned} \] where \(\frac{1}{\sigma} = 7\) days is the latency period and \(\frac{1}{\gamma} = 7\) is the recovery period. Note that \(N_i = S_i + E_i + I_i + R_i\). We assume that \(\beta_{ik}\) which represents the transmission rate from population \(k\) to population \(i\) is proportional to \(C_{ik}\), the colocation probability for population \(i\) and population \(k\).
The following parameterization is an adaptation of the contact model given in the paper, Modeling the impact of human mobility and travel restrictions on the potential spread of SARS-CoV-2 in Taiwan.
The \(\beta_{ik}\) are defined as follows: \[ \begin{aligned} \beta_{ik} = \gamma R_{0 ik} \end{aligned} \] where \[ \begin{aligned} R_{0 ik} & = R_{0 \: Hanoi'} \frac{C_{ij}}{C_{Hanoi'-Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density.
Let’s calculate the district in Hanoi with the highest density.
HanoiDist <- vn2 %>%
filter(province == "Hà Nội") %>%
arrange(desc(den_km2)) %>%
head(1) %>%
pull(district)
HanoiDistID <- vn2 %>%
filter(province == "Hà Nội") %>%
filter(district == HanoiDist) %>%
pull(polygon_id)
As such, we will set the \(R_0\) of Hoàn Kiếm to 18; that is, \(R_{0 \: Hanoi'} = 18\).
The following function computes the \(R_{0 ik}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.
compute_R0 <- function(coloc_mat, standardR0, standardDistID) {
melt(coloc_mat) %>%
setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
mutate(R0 = standardR0 * link_value / coloc_mat[paste(standardDistID), paste(standardDistID)])
}
Let’s compute the \(R_{0 ik}\) with \(R_{0 \: Hanoi'} = 18\):
R0 <- compute_R0(coloc_mat, 18, HanoiDistID)
Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.
intracityR0 <- R0 %>%
filter(polygon1_id == polygon2_id) %>%
rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 60, 10))
axis(2)
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 60, 10))
axis(2)
quantile(intracityR0$intraR0, seq(0, 1, le = 11))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.000000 1.586063 2.008983 2.373354 2.887077 3.456204 4.453696 6.125704
## 80% 90% 100%
## 8.322006 13.101921 55.396777
Let’s map the intracity \(R_0\) values for the districts:
intracityR0sf <- vn2 %>%
right_join(select(intracityR0, -polygon2_id), c("polygon_id" = "polygon1_id"))
The following district have intracity \(R0\) greater than 10.
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
select("intraR0") %>%
filter(intraR0 > 10) %>%
plot(add = TRUE, col = "red")
The following districts have intracity \(R0\) greater than 20.
intracityR0sf %>%
filter(intraR0 > 20) %>%
arrange(desc(intraR0)) %>%
st_drop_geometry() %>%
select(province, district, intraR0)
## province district intraR0
## 1 Yên Bái Trạm Tấu 55.39678
## 2 Hà Giang Hoàng Su Phì 42.67776
## 3 Quảng Nam Tây Giang 42.55601
## 4 Quảng Ngãi Tây Trà 41.49006
## 5 Kon Tum Kon Plông 33.77775
## 6 Thanh Hóa Mường Lát 32.71860
## 7 Sơn La Sốp Cộp 32.20555
## 8 Quảng Nam Phước Sơn 31.46382
## 9 Hà Giang Xín Mần 31.07934
## 10 Hà Giang Đồng Văn 29.85194
## 11 Quảng Ngãi Lý Sơn 29.61061
## 12 Hà Giang Mèo Vạc 28.03566
## 13 Cao Bằng Bảo Lạc 27.90237
## 14 Quảng Nam Nam Trà My 26.41899
## 15 Lạng Sơn Đình Lập 26.37892
## 16 Cao Bằng Bảo Lâm 24.83981
## 17 Sơn La Bắc Yên 24.52896
## 18 Quảng Ninh Ba Chẽ 23.66076
## 19 Quảng Ninh Bình Liêu 23.55896
## 20 Điện Biên Tủa Chùa 23.15901
## 21 Cao Bằng Thông Nông 23.08779
## 22 Điện Biên Điện Biên Đông 22.44800
## 23 Bình Thuận Phú Quí 22.33890
## 24 Hồ Chí Minh 4 22.28844
## 25 Gia Lai Ayun Pa 22.03421
## 26 Hà Giang Quản Bạ 20.98567
## 27 Hồ Chí Minh 3 20.84098
## 28 Lào Cai Si Ma Cai 20.67039
## 29 Quảng Trị Quảng Trị 20.22414
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intracityR0sf %>%
filter(intraR0 > 20) %>%
select(intraR0) %>%
plot(add = TRUE, col = "red")
It appears that remote regions of Vietnam have relatively higher intracity \(R_0\) values. Let’s verify this by graphing relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.
ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()
Both populations with relatively low population densities and populations with relatively high population densities have high intracity \(R_0\) values. Populations with especially low densities have the greatest intracity \(R_0\) values. Note that we should see a positive relationship between population density and intracity \(R_0\) values. The intracity \(R_0\) values need to be adjusted to fix the artifact of how the colocation matrix is constructed.
Let’s consider the intercity R0 values:
intercityR0 <- R0 %>%
filter(polygon1_id != polygon2_id) %>%
rename(interR0 = R0)
totalInter <- intercityR0 %>%
group_by(polygon1_id) %>%
summarize(interR0 = sum(interR0)) %>%
arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:
intercityR0sf <- vn2 %>%
right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>%
st_geometry() %>%
plot(col = "grey")
intercityR0sf %>%
select("interR0") %>%
filter(interR0 > 1) %>%
plot(add = TRUE, col = "red")
Let’s graph the relationship between population density and intercity \(R_0\) values:
ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()
As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.
Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:
tmp <- intercityR0sf %>%
right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
geom_abline(slope = 1, intercept = 0)
There are two distinct groups in the plot: one group has larger intercity \(R_0\) values relative to intracity \(R_0\) values, while the other has larger intracity \(R_0\) values relative to intercity \(R_0\) values.
Let’s use a linear mixture model to assign each point to a particular linear model:
library(flexmix)
## Loading required package: lattice
model <- flexmix(interR0 ~ intraR0, tmp, k = 2)
summary(model)
##
## Call:
## flexmix(formula = interR0 ~ intraR0, data = tmp, k = 2)
##
## prior size post>0 ratio
## Comp.1 0.219 124 666 0.186
## Comp.2 0.781 574 643 0.893
##
## 'log Lik.' -498.521 (df=7)
## AIC: 1011.042 BIC: 1042.88
plot(tmp$interR0, tmp$intraR0, col = clusters(model),
xlab = "intercity R0", ylab = "intracity R0")
Let’s map the clustering observed in the linear mixture model to the districts:
vn0 %>%
st_geometry() %>%
plot(col = "grey")
colors <- clusters(model)
colors[colors == 1] = 3
tmp %>%
transmute(cluster = clusters(model)) %>%
st_geometry() %>%
plot(add = TRUE, col = colors)
It is a little noisy.
One model fairly used in geography is the so-called gravity model that says that the connection between any 2 locations \(i\) and \(j\) is:
\[ C_{ij} = \theta\frac{P_i^{\tau_1}P_j^{\tau_2}}{d_{ij}^\rho} \] See for example 10.1126/science.1125237.
TO DO: test this model on the facebook data.